About Me

Professional Background

💼 Career
|- EY - Manager, Valuation, Modeling, & Economics
|- PG&E - Supervisor, Capital Recovery & Analysis
|- KPMG - Sr Manager, Economics & Valuation
|- Centene - Data Scientist III, Strategic Insights
|- Bloomreach - Sr Manager, Data Ops & Analytics
|- Centene - Lead Machine Learning Engineer

📚 Education
|- Georgia Tech - BS Management
|- UC Irvine - MS Business Analytics

Technical Skills Gained

💼 Career
|- EY - Microsoft Excel, Microsoft Access, Financial Modeling
|- PG&E - SQL, ODBC (connecting Access / Excel to EDWs)
|- KPMG - VBA scripting, Excel add-ins (Power Query and Power BI) |- Centene - R, Web Apps, Package Dev, ML / AI, Cloud DS Tools
|- Bloomreach - GCP, Amazon Redshift, Linux, Google Workspaces
|- Centene - Docker, k8s, Databricks, Linux, bash, GenAI + LLMs

📚 Education
|- Georgia Tech - Finance, Business Management
|- UC Irvine - Business Analytics, Data Science

Data Science Issues

AI… It’s an umbrella

AI… It’s a growing umbrella

Classical ML… Another umbrella

ML Issues

  • How should I measure my baseline?

  • Which ML algorithm(s) should I use?

  • How should I split my training and testing data?

  • How should I evaluate my model fit?

  • Which performance evaluation metric(s) should I use?

  • Every ML package has a unique set functions and arguments

and the most annoying issue…

data

sucks!

Data sucks!

  • Missing data (NAs, nulls, etc.)
    • You may need to impute missing values
  • Categorical variables
    • ML algorithms might require dummy or one-hot encoding
  • Data type issues
    • For example, character strings and numbers in the same column
  • Too many variables

About Tidymodels

What is Tidymodels?

What is Tidymodels?

  • A collection of R packages for reproducible ML

  • Follows tidy principles:
    • Consistent interface
    • Human-readable code
    • Reproducible workflows

  • Provides a unified syntax for ML

The ML Workflow (with Tidymodels)

Let’s Build a Classification Model!

Our tidymodels workflow will follow the following steps:

  1. Load libraries and data
  2. Split data (training vs testing)
  3. Create recipe for preprocessing
  4. Specify model
  5. Create workflow
  6. Train model
  7. Evaluate performance
  8. Visualize model performance
  1. Create setup for hyperparameter tuning
  2. Create cross-validation folds
  3. Define the tuning grid
  4. Tune the model
  5. Visualize tuning results
  6. Select the best model
  7. Final fit
  8. Variable importance

Step 0: Load Libraries & Data

Prediction problem: Predict survival of Titanic passengers

# Load necessary packages
library(tidyverse)
library(tidymodels)
library(titanic)

# Load and prepare data
data(titanic_train)
titanic_data <- as_tibble(titanic_train) |> 
  mutate(Survived = factor(Survived, levels = c(0, 1)))

Step 0: Load Libraries & Data

Prediction problem: Predict survival of Titanic passengers

# Load necessary packages
library(tidyverse)
library(tidymodels)
library(titanic)

# Load and prepare data
data(titanic_train)
titanic_data <- as_tibble(titanic_train) |> 
  mutate(Survived = factor(Survived, levels = c(0, 1)))  # Convert target variable to a factor level

# Take a look at the data
glimpse(titanic_data)
Rows: 891
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

Step 1a: Split the Data

# Split the data (create a singular object containing training and testing splits)
set.seed(123)
titanic_split <- initial_split(
  data = titanic_data, 
  prop = 0.75, 
  strata = Survived # stratify split by Survived column
)

# Create training and testing datasets
train_data <- training(titanic_split)
test_data <- testing(titanic_split)

Step 1b: Take a glimpse() at the splits

# Use dplyr::glimpse() to review the training split
glimpse(train_data)
Rows: 667
Columns: 12
$ PassengerId <int> 6, 7, 8, 13, 14, 15, 19, 21, 25, 27, 28, 31, 36, 38, 41, 4…
$ Survived    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Pclass      <int> 3, 1, 3, 3, 3, 3, 3, 2, 3, 3, 1, 1, 1, 3, 3, 2, 3, 3, 3, 3…
$ Name        <chr> "Moran, Mr. James", "McCarthy, Mr. Timothy J", "Palsson, M…
$ Sex         <chr> "male", "male", "male", "male", "male", "female", "female"…
$ Age         <dbl> NA, 54, 2, 20, 39, 14, 31, 35, 8, NA, 19, 40, 42, 21, 40, …
$ SibSp       <int> 0, 0, 3, 0, 1, 0, 1, 0, 3, 0, 3, 0, 1, 0, 1, 1, 0, 0, 1, 2…
$ Parch       <int> 0, 0, 1, 0, 5, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Ticket      <chr> "330877", "17463", "349909", "A/5. 2151", "347082", "35040…
$ Fare        <dbl> 8.4583, 51.8625, 21.0750, 8.0500, 31.2750, 7.8542, 18.0000…
$ Cabin       <chr> "", "E46", "", "", "", "", "", "", "", "", "C23 C25 C27", …
$ Embarked    <chr> "Q", "S", "S", "S", "S", "S", "S", "S", "S", "C", "S", "C"…

Step 1c: Check the Stratification

# Review the proportion of 0s and 1s in each split
train_data |> 
  summarise(Train_Rows = n(), .by = Survived) |> 
  mutate(Train_Percent = Train_Rows / sum(Train_Rows)) |> 
  left_join(test_data |> 
    summarise(Test_Rows = n(), .by = Survived) |> 
    mutate(Test_Percent = Test_Rows / sum(Test_Rows)),
  join_by(Survived))
# A tibble: 2 × 5
  Survived Train_Rows Train_Percent Test_Rows Test_Percent
  <fct>         <int>         <dbl>     <int>        <dbl>
1 0               411         0.616       138        0.616
2 1               256         0.384        86        0.384

Step 2: Create a Modeling Recipe

# Create a pre-processing recipe
titanic_recipe <- recipe(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare, 
                         data = train_data) |>
  step_impute_median(Age) |>               # Handle missing values in Age
  step_dummy(all_nominal_predictors()) |>  # Convert categorical variables to dummy variables
  step_normalize(all_numeric_predictors()) # Normalize numeric predictors

titanic_recipe

Step 3: Specify the Model

With {parsnip}, you have a unified interface for ML:

# Specify a logistic regression model
log_model <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

log_model
Logistic Regression Model Specification (classification)

Computational engine: glm 

Step 4: Create a Workflow

# Create a workflow
titanic_workflow <- workflow() |>
  add_recipe(titanic_recipe) |>
  add_model(log_model)

titanic_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_impute_median()
• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Computational engine: glm 

Step 5: Train the Model

# Fit the workflow to the training data
titanic_fit <- last_fit(titanic_workflow, titanic_split)

titanic_fit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [667/224]> train/test split <tibble> <tibble> <tibble>     <workflow>

Step 6: Evaluate the Model

# Collect metrics from the `last_fit()` model object
collect_metrics(titanic_fit)
# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.804 Preprocessor1_Model1
2 roc_auc     binary         0.878 Preprocessor1_Model1
3 brier_class binary         0.132 Preprocessor1_Model1

Step 7: Visualize Model Performance

# Plot confusion matrix
titanic_fit |>
  collect_predictions() |> 
  conf_mat(truth = Survived, estimate = .pred_class) |> 
  autoplot(type = "heatmap")

Step 8: Hyperparameter Tuning

Let’s try modeling with {glmnet} and tuning

# Create a tunable model specification
log_model_tune <- logistic_reg(
  penalty = tune(),
  mixture = tune()
) |>
  set_engine("glmnet") |>
  set_mode("classification")

# Create a tuning workflow
tune_workflow <- workflow() |>
  add_recipe(titanic_recipe) |>
  add_model(log_model_tune)

Step 9: Create Cross-Validation Folds

# Create cross-validation folds
set.seed(234)
titanic_folds <- vfold_cv(train_data, v = 5, strata = Survived)

titanic_folds
#  5-fold cross-validation using stratification 
# A tibble: 5 × 2
  splits            id   
  <list>            <chr>
1 <split [532/135]> Fold1
2 <split [534/133]> Fold2
3 <split [534/133]> Fold3
4 <split [534/133]> Fold4
5 <split [534/133]> Fold5

Step 10: Define the Tuning Grid

# Create a grid of hyperparameters to try
log_grid <- grid_regular(
  penalty(range = c(-3, 0), trans = log10_trans()),
  mixture(range = c(0, 1)),
  levels = c(4, 5)
)

log_grid
# A tibble: 20 × 2
   penalty mixture
     <dbl>   <dbl>
 1   0.001    0   
 2   0.01     0   
 3   0.1      0   
 4   1        0   
 5   0.001    0.25
 6   0.01     0.25
 7   0.1      0.25
 8   1        0.25
 9   0.001    0.5 
10   0.01     0.5 
11   0.1      0.5 
12   1        0.5 
13   0.001    0.75
14   0.01     0.75
15   0.1      0.75
16   1        0.75
17   0.001    1   
18   0.01     1   
19   0.1      1   
20   1        1   

Step 11: Tune the Model

# Tune the model
set.seed(345)
log_tuning_results <- tune_grid(
  tune_workflow,
  resamples = titanic_folds,
  grid = log_grid,
  metrics = metric_set(accuracy, roc_auc)
)

log_tuning_results
# Tuning results
# 5-fold cross-validation using stratification 
# A tibble: 5 × 4
  splits            id    .metrics          .notes          
  <list>            <chr> <list>            <list>          
1 <split [532/135]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
2 <split [534/133]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
3 <split [534/133]> Fold3 <tibble [40 × 6]> <tibble [0 × 3]>
4 <split [534/133]> Fold4 <tibble [40 × 6]> <tibble [0 × 3]>
5 <split [534/133]> Fold5 <tibble [40 × 6]> <tibble [0 × 3]>

Step 12: Visualize Tuning Results

# Show the best models
show_best(log_tuning_results, metric = "roc_auc")
# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config              
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1   0.1      0    roc_auc binary     0.843     5  0.0126 Preprocessor1_Model03
2   0.001    0    roc_auc binary     0.843     5  0.0121 Preprocessor1_Model01
3   0.01     0    roc_auc binary     0.843     5  0.0121 Preprocessor1_Model02
4   0.01     0.25 roc_auc binary     0.842     5  0.0114 Preprocessor1_Model06
5   0.01     0.75 roc_auc binary     0.842     5  0.0107 Preprocessor1_Model14

Step 12: Visualize Tuning Results (cont.)

# Create a visualization of the tuning results
autoplot(log_tuning_results)

Step 13: Select the Best Model

# Select the best hyperparameters
best_params <- select_best(log_tuning_results, metric = "roc_auc")

best_params
# A tibble: 1 × 3
  penalty mixture .config              
    <dbl>   <dbl> <chr>                
1     0.1       0 Preprocessor1_Model03
# Finalize the workflow with the best parameters
final_workflow <- finalize_workflow(tune_workflow, best_params)

Step 14: Final Fit

# Fit the final model to the entire training set and evaluate on test set
final_fit <- final_workflow |>
  last_fit(titanic_split)

# Get the metrics
collect_metrics(final_fit)
# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.790 Preprocessor1_Model1
2 roc_auc     binary         0.872 Preprocessor1_Model1
3 brier_class binary         0.146 Preprocessor1_Model1

Step 15: Variable Importance

# Extract the fitted workflow
fitted_workflow <- final_fit |>
  extract_workflow()

# Extract the fitted model
fitted_model <- fitted_workflow |>
  extract_fit_parsnip()

# Calculate variable importance
vip::vip(fitted_model)

Tidymodels Benefits

  • Consistent Interface: Same syntax across different ML algorithms
  • Modularity: Each step is a separate function that can be modified
  • Reproducibility: Workflows capture the entire modeling process
  • Extensibility: Easy to add new steps or algorithms
  • Visualization: Built-in tools for visualizing results
  • Tuning: Streamlined process for hyperparameter optimization

Thank you! 🤍

questions?



Connect with me!